Goto

Collaborating Authors

 model parallelism



Piper: MultidimensionalPlanner forDNNParallelization

Neural Information Processing Systems

In the "modern era", such model-parallel training techniques trace their roots back to AlexNet [14] and early influential systems such as DistBelief [6] and Project Adam [3].



Piper: Multidimensional Planner for DNN Parallelization

Neural Information Processing Systems

The rapid increase in sizes of state-of-the-art DNN models, and consequently the increase in the compute and memory requirements of model training, has led to the development of many execution schemes such as data parallelism, pipeline model parallelism, tensor (intra-layer) model parallelism, and various memory-saving optimizations. However, no prior work has tackled the highly complex problem of optimally partitioning the DNN computation graph across many accelerators while combining all these parallelism modes and optimizations.In this work, we introduce Piper, an efficient optimization algorithm for this problem that is based on a two-level dynamic programming approach. Our two-level approach is driven by the insight that being given tensor-parallelization techniques for individual layers (e.g., Megatron-LM's splits for transformer layers) significantly reduces the search space and makes the global problem tractable, compared to considering tensor-parallel configurations for the entire DNN operator graph.


ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training

Ding, Yuran, Chen, Xinwei, Zhang, Xiaofan, Zhou, Zongwei

arXiv.org Artificial Intelligence

Optimizing large-language model (LLM) training on distributed domain-specific accelerator systems presents significant challenges due to its complex optimization space. Existing optimization methods, however, rely on time-consuming manual tuning or resource-intensive black-box searches, which struggle to keep pace with the rapidly evolving LLM domain, leading to slow development and underutilized resources. To address this, we introduce ASAP, an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training. It is a multi-agent system, featuring Coordinator, Analyzer, and Proposal agents, which integrates LLM reasoning with insights from performance profiling tools, roofline analysis, and a knowledge base of best practices and successful past optimizations from human experts. Our proposed design can automate the diagnosis of performance bottlenecks and recommend optimized sharding configurations with reasoning, thus effectively improving the efficiency of distributed LLM training. Experiments have shown that the ASAP-generated sharding configurations can contribute up to 28% training step time reduction and 1.43 times throughput improvement. When combined with additional optimization from human experts, throughput can be further increased to 2.58 times. The proposed ASAP promises to provide a scalable and explainable methodology for AI-assisted performance engineering in large-scale LLM training.


Supplementary material of LoCo: Local Contrastive Representation Learning

Neural Information Processing Systems

In this section we show the block structure of each stage in Progressive ResNet-50 in Table 1. The results are shown in Fig 2. We can see LoCo learns image embedding Last, we show qualitative results of detection and instance segmentation tasks on COCO in Figure 1.


ffe10334251de1dc98339d99ae4743ba-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for their thoughtful comments. But consider the case of training BERT on a TPU pod, which takes around 4 days. We provide a formalization of the problem with rigorous guarantees. We now address a few of the specific reviewer concerns. However, in the revised version of this paper we will include a more thorough discussion of this. That post draws on Courcelle's theorem (namely, every graph property definable in the monadic second-order We feel that it's more accurate to avoid